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Abstract 

Shi and Tsai (JRSSB, 2002) proposed an interesting residual information criterion 
(RIC) for model selection in regression. Their RIC was motivated by the principle 
of minimizing the KuUback-Leibler discrepancy between the residual likelihoods of the 
true and candidate model. We show, however, under this principle, RIC would always 
choose the full (saturated) model. The residual likelihood therefore, is not appropriate 
as a discrepancy measure in defining information criterion. We explain why it is so and 
provide a corrected residual information criterion as a remedy. 

KEY WORDS: Residual information criterion; Corrected residual information crite- 
rion. 

1 Introduction 

Given n iid observations from a true model 

y = XPo + £, 

where y = (yi, ...,y„)', X is a n x p design matrix, e = (ei, ...,en)' follows a multivariate 
distribution with mean and variance aQW^Oo), and /3o G TU'^^ is an unknown vector to 
be estimated. Here is an m x 1 vector parameterizing the correlation matrix. Finally, we 
denote Aq = A{Po) = {j : Pqj / 0, j = 1, as the nonzero coefficient set and = #Ao 
as the number of nonzero coefficients. The problem of estimating is often referred to as 
variable selection or model selection. 

Variable selection in linear regression is probably one of the most important problems 
in statistics. See for example the references in Shao (1997). To automate the process of 
choosing a finite dimensional candidate model out of all possible models, various information 
criteria have been developed. There are two basic elements in all of these criteria: one 
element that measures the goodness of fit and the other term which penalizes the complexity 
of the fitted model, usually taken as a function of the parameters used. Generally speaking, 
the existing variable selection approaches can be classified into two broad categories. On 
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one hand, AIC type of criteria, such as AIC (Akaike, 1970) and AICc (Hurvich and Tsai, 
1989), seek to minimize the Kullback-Lciblcr divergence between the true and candidate 
model. On the other hand, BIG (Schwarz, 1978) type of criteria are used to identify a 
candidate model to achieve selection consistency. Obviously, these criteria are motivated 
by different assumptions and different considerations, practically and theoretically. Any 
particular choice on which one to use probably depends on the context and is subject to 
criticism, as each has its own merits and shortcomings. 

In an important paper, Shi and Tsai (2002) proposed an interesting information criterion 
termed the residual information criterion (RIC). The authors showed that RIG is motivated 
by the consideration of minimizing the discrepancy between the residual log-likelihood func- 
tions of the true and candidate model. However, surprisingly, the authors arrived at a BIG 
type of criterion, in marked contrast with some other information criteria, such as AIG, 
AIGc, motivated by the same principle of minimizing Kullback-Leibler discrepancy. 

In this paper, we show that the RIG approach is not targeting at minimizing the 
Kullback-Leibler discrepancy between residual likelihoods. We provide a corrected crite- 
rion RIG* motivated by this principle. However, we show that if the residual likelihoods are 
used to evaluate the Kullback-Leibler divergence between models, RIG (i.e. RIG*) would 
always choose the full model. Therefore, the residual likelihood is not an appropriate loss 
function to define an information criterion. We provide a simple likelihood based approach 
to circumvent the problem. 

The rest of the paper is organized as follows. Section 2 reviews the RIG method in 
Shi and Tsai. Since Shi and Tsai's RIG is not approximating the Kullback-Leibler diver- 
gence, we provide the RIG* measure as a correction. However, RIG* always chooses the 
full model and the reason is explained. Section 3 presents the correct residual likelihood 
information criterion, motivated by minimizing the Kullback-Leibler divergence between 
likelihoods instead of residual likelihoods. Goncluding remarks are given in Section 4. 

2 The Residual Information Criterion 

We review the RIG method in Shi and Tsai (2002) in this section. The model we consider 
in this article is a special case of that in Shi and Tsai (2002) by assuming the Box-Gox 
transformation parameter A is 1. The results in the paper can be easily extended to Box- 
Cox models following similar arguments in Shi and Tsai. 

We start by looking at a candidate (working) model 

y = Xf5 + e, 

such that i^A{P) = k. We denote the active covariates in X as X_^. Inspired by the residual 
likelihood method in Harville (1974) or Diggle et al. (1994) to obtain unbiased estimator 
for the error variance, we can write the residual log-likelihood as 

L{G', = - ^(^ - k) log(27r) + ^ log \X'_^X^\ - i(n - k) log{a^) - ^ log \W\ 

^ i log \X'^W-'Xa\ - \y\W-^ - HX)y/a'', (1) 
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where i?^ = X'_^{X'_^W~^ Xji)~^ X'_^W~^ and the dependence of on 6* is suppressed. 
A useful measure of the distance between the working model and the true model is the 
Kullback-Leibler divergence 

die',a^) = Eo[-2Li0',a^)+2Loie'o,a^o)], (2) 

where Eq denotes the expectation under the true model and Lq denotes the residual log- 
likelihood of the true model. Clearly, the best model loses the least information, in terms of 
Kullback-Leibler distance, relative to the truth and is therefore preferred. Such a criterion 
formulates RIC in an information-theoretical framework. Provided that one can unbiasedly 
estimate d{9', cj^), this criterion provides sound basis for parameter estimation and statistical 
inference under appropriate conditions. 

Since £'o[2Lo(^0' "^o)] independent of the working model, we just need to evaluate 
Eo[-2L{e',a^)]. In Shi and Tsai (2002), 1^ is written as 

d{e', cj2) = Eo [(n - k) log(a2) + log \W\ + log \X'^W-^Xa\ 

+ y'{W-' - H^)y/a^] (3) 

= {n-k) log{a^) + log \W\ + log \X'^W-^Xa\ 

+ Eo{X/3o + eYiW-^ - HA){Xf3o + e)/a^ (4) 

by omitting irrelevant terms. By substituting their estimated values 6, into ([3]), we have 

d{e\ d^) = {n-k) iog{a^) + log \w\ + log \x'_aW-^Xa\ 

(5) 

The above expression involves an unknown quantity o"q. Following Shi and Tsai, we judge 
the quality of the candidate model by EQ{d{6' ,a^)}. Now, if we assume ^ -4, an 
assumption also used in deriving AICc (Hurvich and Tsai, 1989), the third term becomes 
zero. Furthermore, if we assume 9 is consistent for we can estimate Wq by W since 
W = Wq + Op(l). Then the fourth term can be approximated as (n — k)aQ/(j'^. Since 
^ -4o, {n — A;)(T^/c7g then follows Xn-k distribution and therefore 

Eo[{n - k)a^/a'^] = (n - kf/in -k-2). 

Finally, Shi and Tsai argued that log [X^Vi^^^X_4j can be approximated by k log(n). Putting 
everything together, they proposed the residual information criterion as follows 

RIC = (n-k) logia^) +\og\W\ + k log(n) -k + ^ , (6) 

n — k — 2 

after removing the constant n + 2. Asymptotically, the complexity part of RIC is of the 
order k\og{n). Comparing to BIC = nlog(a^) + A;log(n), where a"^ is the MLE of ctq, it 
is intuitively clear that Shi and Tsai's RIC yields consistent models as BIC does. The 
complexity penalty of RIC, however, is fundamentally different from that of other familiar 
information criterion such as AIC and AICc, designed to approximate the Kullback-Leibler 
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divergence between two models. This observation raises the question on whether RIC 
rightfully approximates the divergence. 

It turns out that Shi and Tsai's derivation motivated by minimizing the Kullback-Leibler 
distance, is incorrect in at least two important places: 

1. In ([3]), a model dependent term log |X^X_4| is omitted from ([1]), which causes serious 
bias in deriving an information criterion. In fact, following Shi and Tsai's arguments, 
we can approximate log |X^X_4| by fclog(n) and thus, RIC should have been 

RIC* = (n - fc) log(a2) + log \W\-k + j -. 

[n — k — 2) 

Note that in this formulation, RIC* always chooses the full model. 

2. Even more severely, the practice of approximating the Kullback-Leibler distance be- 
tween residual likelihoods for comparing models is totally wrong. To illustrate, sup- 
pose that W = I. In this simple case, the residual likelihood becomes 

We see immediately that EQ[—2L{a'^)] = (n — k) \og{a^) + {n — k)aQ/a^ whenever Ao C 
A. Thus, for candidate models that include X^a^ in the covariate set, EQ[—2L{a'^)] is 
always minimized by cr^ = do and in this case £'o[— 2L((T^)] = (n — A;)(log(cro) -|- 1). 
Therefore, if one knows the exact data generating process, the ideal RIC leads to the 
full model, as its EQ[—2L[a'^)] is the smallest. This explains why RIC* always chooses 
the full model. 

Given the above serious flaws in going from deriving unbiased estimator of the Kullback- 
Leibler divergence to RIC, Shi and Tsai's RIC in Q seems improperly motivated. Fortu- 
nately, Shi and Tsai's derivation can be corrected and we introduce a corrected RIC in the 
next section. 

3 A Corrected Residual Information Criterion 

Instead of using the residual likelihood, a justifiable criterion is to use the log-likelihood 

L{(3', e', a^) = nlog{a^) + log \W\ + {y - XpyW-^y - Xp) 
in defining the divergence 

d{(3', e', a^) = Eo[-2L{(3', 9', a^) + 2LM. O'o, 'Jo)]- 

We can write 

Eo[-2LiP', e', a^)] = Eo [n log(cj2) + log \W\ + (Xpo + e- XpyW-^XPo + e - XP)] 

= n log(cT2) + log \W\ + nal/a^ + - XpoyW-\Xp - Xpo)ai/a^. 
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Wc can now replace o"^, /? and 6 by the their estimates by using the residual likelihood 
method. Now, suppose that ^ Following Shi and Tsai again, E^naQ/a'^ ~ n{n — 
k)/{n — k — 2). Since 

P- Po follows normal distribution N{0,a^{X'^W-'^X_A)-'^} asymp- 
totically, 

^{X$ - XpoYW-Hxp - X/3o)a2/a2 
is distributed approximately as F{k,n — k). Therefore, 

\ijK/-i/va vo\^2/±.2-i k{n — k) 



Eo{{Xp - XpoyW-\X(3 - Xpo)a'o/a'} 



n — k 



Putting everything together, we have the following corrected residual information criterion, 
which we shall refer to as RICc, 

RICc = nlog(a2) + ifc+ ^(^ + 1) 



n-k-2' 
by omitting a constant n + 2. Note that 

AIC = nlog(a^) + 2A;, 

and 

AICc = nlog(a2) + 2n{k + l)/{n-k- 2) 

where (T^ is the MLE of cjq. We can decompose the first expression of RICc, AIC and AICc as 
nlog(RSS)— nlog(n— fc), nlog(RSS) — nlog(n) and nlog(RSS)— nlog(n) respectively. Thus, 
the complexity penalties for RICc, AIC, AICc are — nlog(n — k) + k + A{k + l)/(n — A; — 2), 
—n log(n) + 2k and — n log(n) + 2n{k + l)/(n — — 2) respectively. It can be seen that RICc 
has a larger penalty function than AIC and a smaller penalty than AICc when n S> /c. 



4 Concluding Remarks 

In fitting a model to data, one is required to choose a set of candidate models, a fitting 
procedure and a criterion to compare competing models. A minimal requirement for a 
reasonable criterion is that the population version of the criterion is uniquely minimized 
by the set of the parameters which generate the data. The population version of the 
residual likelihood information criterion is minimized by the full model and thus fails to meet 
this basic requirement. Therefore, the residual likelihood cannot be used as a discrepancy 
measure between models. A simple remedy is to use the likelihood based Kullback-Leibler 
divergence. 

Being a legitimate criterion on its own, our arguments show that Shi and Tsai's RIC 
is not motivated by the right principle. Should one have followed their motivation, RIC 
(i.e. RIC* by our notation) would have always chosen the full model. However, Shi and 
Tsai's RIC, though motivated by the wrong principle (using the residual likelihood instead 
of the likelihood) and ignoring dangerously an important term log l^'-X"! in approximation, 
has good small sample performance in their simulations. Additionally, Shi and Tsai's RIC 
has been successfully applied to a number of applications, such as normal linear regression, 
Box-Cox transformation, inverse regression models (Ni et al, 2005) and longitudinal data 
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analysis (Li et al, 2006). The success may be understood as Shi and Tsai's RIC resembles 
BIC. Despite the increasing popularity of RIC, Shi and Tsai's RIC remains unmotivated. 
It remains to find a justification for Shi and Tsai's RIC as a future research topic. 
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